-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Transfer manager checksums download #3629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| actualChecksum = getResult.GetChecksumCRC64NVME(); | ||
| } | ||
|
|
||
| if (!actualChecksum.empty() && actualChecksum != handle->GetChecksum()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check for !actualChecksum.empty()? Does it have to be non-empty or a simple mismatch is enough for us to error out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, !actualChecksum.empty() is redundant here.
| } else if (!getResult.GetChecksumCRC64NVME().empty()) { | ||
| partState->SetChecksum(getResult.GetChecksumCRC64NVME()); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic of retrieving a checksum from an S3Outcome is repeated a few times here. Could be worth trying something like this (maybe with SFINAE) in the header file so we can centralize this logic:
template<typename OutcomeType>
Aws::String ExtractChecksum(const OutcomeType& outcome) {
const auto& result = outcome.GetResult();
if (!result.GetChecksumCRC32().empty()) {
return result.GetChecksumCRC32();
}
if (!result.GetChecksumCRC32C().empty()) {
return result.GetChecksumCRC32C();
}
if (!result.GetChecksumSHA256().empty()) {
return result.GetChecksumSHA256();
}
if (!result.GetChecksumSHA1().empty()) {
return result.GetChecksumSHA1();
}
if (!result.GetChecksumCRC64NVME().empty()) {
return result.GetChecksumCRC64NVME();
}
return {};
}
Then we can just call it:
auto actualChecksum = ExtractChecksum(getObjectOutcome);
partState->SetChecksum(ExtractChecksum(outcome))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed - this refactoring makes sense to eliminate the duplication.
| for (const auto& part : completedParts) { | ||
| sortedParts.emplace_back(part.first, part.second->GetChecksum()); | ||
| } | ||
| std::sort(sortedParts.begin(), sortedParts.end()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be COMPLETELY wrong here, but is sorting needed since XOR is associative? Might be missing some s3 specific context
c50035d to
af496ac
Compare
| } | ||
| } | ||
|
|
||
| bool TransferManager::ValidateDownloadChecksum(const std::shared_ptr<TransferHandle>& handle) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whenever we return a bool from a function we usually want a Aws::Utils::Outcome instead. instead of simply returning a true or false fail, we should return a proper error if something happened.
| hashCalculator = Aws::MakeShared<Aws::Utils::Crypto::CRC64>("TransferManager"); | ||
| } | ||
|
|
||
| auto fileStream = Aws::MakeShared<Aws::FStream>("TransferManager", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we are opening the file and re-reading the entire stream buffering and then failing the download if the file is different. this gives me two concerns:
- a file that is corrupted is written to disc
Failure handling: The S3 Transfer Manager SHOULD download to a file in the same directory with .s3tmp.{uniqueIdentifier} (uniqueIdentifier <= 8 chars) suffix to indicate the file is in the process of being downloaded, but not yet complete. After all content has been written, rename the temporary file to the destination file using atomic rename operation. If there is an error, delete the temporary file. The S3 Transfer Manager SHOULD register cleanup handlers to remove the temporary file if the process terminates unexpectedly, if supported. The S3 Transfer Manager MUST NOT pre-validate available space because it can introduce race conditions and false failures.
This concerns me that this would cause an issue where we would leave a file that has failing data intergrity.
- we are reading the stream twice.
on files that are massive think 5TB, this can introduce overhead. i think this is covered in the spec with
If full object checksum is calculated inline while downloading source, the S3 Transfer Manager MAY buffer full parts in-memory while flushing is blocked on updating the running checksum object.
this really means that while we write out to the file we want to be updating the checksum
Description of changes: Adding checksum validation on s3 download
Check all that applies:
Check which platforms you have built SDK on to verify the correctness of this PR.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.